Main topic of this data science project is to optimize the collection process. Sponsor of the project is Mr. Jiří Procházka, who decided on the scope of the project, provided data and helped define project deliverables.

First of all it was important to understand the collection process. Clients are paying certain amount due to the certain date. If the client delays his payment, certain collection procedure takes place because of the costs that rise for the company. There are three actions taking place under different conditions. If the payment is delayed for 21 - 69 days, first action is taken. Second action is taken if the delay is between 70 - 139 days and the third action is taken for delay greater than 140 days.

Our goal was to create three models. First model predicts whether the customer will be delayed in payment for 21+ days. Second model predicts whether the customer will be delayed in payment for 140+ days. Lastly, third model should estimate average number of days delayed if the client would exceed the first action. All three models were defined baseline accuracy, to which our models were compared.

As one of the deliverables, project documentation describes project flow. Project consisted of three major steps: Data preparation, Modeling, Evaluation. All of the phases are described in detail with visualized results and findings in this document.

We chose to use R studio to prepare the data, create models and evaluate, RMarkdown was used to create documentation of the project.

1 Data Preparation

1.1 Data Understanding

1.1.1 Data Description Report

The initial data was provided in a comma-separated values file, and was loaded and processed using the R programming language. Dataset used in this analysis contains 2 353 012 observations and 24 variables. Out of the 24 variables, 13 are of factor datatype, 9 are numeric and 2 are dates. All columns from the initial data were converted to the correct datype according to the data description file, which was provided. Column payed_ammount was replaced by column paid_amount. Column payment_date originally contained some blank fields, which were subsequently filled in as NA.

Column name Description Type Values
contract_id Unique identificator of the contract Int {1,2,3,…,N}
payment_order Order of the payment Int {1,2,3,…}
due_date Payment deadline Date YY/MM/DD
payment_date Date of the payment Date YY/MM/DD
product_type Type of the product Factor {1,2,3,4,5}
contract_status Contract status Factor {1,2,3,4,5,6,7,8,9}
business_discount Business discount provided Factor {0,1}
gender Gender Factor {1,2}
marital_status Marital status Factor {1,2,3,4,5,6}
number_of_children Number of children Int {1,2,3,…}
number_other_product Number of other products Int {1,2,3,…}
clients_phone T/F if the client filled in home phone Factor {True, False}
client_mobile T/F if the client filled in mobile phone Factor {True, False}
client_email T/F if the client filled in email address Factor {True, False}
total_earnings Earning bucket Factor {level1,…, not_declared}
birth_year Birth year of the client Int {1990,1991,…}
birth_month Birth month of the client Int {1,2,3,…}
living_area Region of the client home address Factor {1,2,3,…}
different_contact_area T/F if the client filled different home and contact address Factor {True, False}
kc_flag T/F if the client does not home local citizenship Factor {True, False}
cf_val If the special measure during the underwriting was applied Numeric {-N,…,N}
kzmz_flag T/F if the client filled in employer Factor {True, False}
due_amount Installment what should be payed Numeric (0,…)
payed_amount What was payed at a certain date Numeric (0,…)

1.1.2 Value ranges

Frequency, relative frequency and relative cumulative frequency were computed for each category in all categorical variables. All frequency tables can be located in the “frequencies” list.

1.1.3 Attribute correlations

We computed correlation coefficients between all possible pairs of numeric variables, see Figure ??, and discovered strong positive correlation between due amount and paid amount. This could be due to the fact that in the event that the installment has already been paid, the due amount and the paid amount would assume the same value. Correlation between the remaining pairs of numeric variables was either nonexistent or negligible. Then, the significance of correlation between due amount and paid amount was tested using Pearson’s product moment correlation coefficient. The pair of attributes was found to be significantly correlated with a correlation coefficient of 0.76 and p-value less than 2.2e-16.

Relationship between categorical variables was tested using chi-squared test with the significance level of 0.05. All significantly correlated pairs of variables can be accessed in “categorical_rel” dataframe. Results can be seen on Figure 1.1.

Correlation diagram

Figure 1.1: Correlation diagram

1.1.3.1 Basic statistics

Basic statistics computed for numeric variables can be located in Table ??. Distribution of numeric and categorical variables was visualized using boxplots, density plots and histograms, see in Attachments, Figure 5.1 and Figure 5.2. NEZOBRAZUJE TAB:

1.2 Data Exploration Report

NIECO HODIT

1.3 Data Quality Report

1.3.1 Data coverage

Next step consisted of the data coverage analysis.

As an example, we have chosen a couple of plots, that indicate interesting data distribution. We for example found out, that clients mostly order the product type 1, contracts are mostly in status number 1 or that most of the payments have a discount. We also discovered, that the marital status of the clients is mostly number 3 and they have most frequently no children. Clients also very frequently do not provide information about their earnings and they ordered usually 1 other product.

All the mentioned findings can be seen on visualizations on Figure 1.2. All the data coverage visualizations are available in the Attachments.

Distribution plots.

Figure 1.2: Distribution plots.

1.3.2 Missing values

Exploring the NA values in the dataset, we found out, that 4 attributes had almost same percentage of missing values, as can be seen on the statistics ???REFERENCIA INAK???. Attributes kc_flag, living_area, cf_val and different_contact_area have the most missing values, almost 20 %, whereas payment_order has around 3,5 % and payment_date and delay have the same percentage, almost 0,5 %.

Using a different visualization, that can be seen on Figure1.3 or Figure1.4, we discovered, that the four attributes with the highest percentage are not missing at random but almost all at the same time.

We found out, that contract_id together with payment_order were not creating a unique key of the data payment, but one payment was divided into multiple parts, which was also causing problem with NA values in the four attributes. Data in the four attributes were not copied into other parts of a payment, but were present in just the first payment part. We decided to unify the payment parts into only one payment by summarizing the paid amount of all the parts and using the payment_date of the last paid part. Thanks to the unification, the amount of NA values has markedly decreased.

Secondly, we dealt with the NA values in payment_order and payment_date. Since it was only less then 4 % of the dataset, and it was not possible to substitute the values, we decided to delete the rows.

## # A tibble: 25 x 3
##    variable               n_miss pct_miss
##    <chr>                   <int>    <dbl>
##  1 different_contact_area 471354   20.0  
##  2 cf_val                 469310   19.9  
##  3 living_area            469015   19.9  
##  4 kc_flag                468906   19.9  
##  5 payment_order           83361    3.54 
##  6 payment_date            11733    0.499
##  7 delay                   11733    0.499
##  8 contract_id                 0    0    
##  9 due_date                    0    0    
## 10 product_type                0    0    
## # ... with 15 more rows
Distribution of missing values.

Figure 1.3: Distribution of missing values.

Results can be seen on Figure 1.4.

Missing values distribution

Figure 1.4: Missing values distribution

1.4 Feature engineering

We decided to add new features to create higher-accuracy models. First of all, we computed a numerical feature “delay” counting the difference between payment_date and due_date.

Since we are creating two classification models deciding whether a new payment will be delayed for more than 21 days or more than 140 days, we created 2 new binary features “delay_21_y” and “delay_140_y”.

We also created a new numerical feature “delay_indiv” counting the mean delay for the whole client´s history. We also computed 2 new numerical features, “delay_indiv_21” and “delay_indiv_140” counting cumulative number of delayed payments (21, 140 days) by one contract

Lastly, numerical features “mean_delay_1m” ,“mean_delay_3m”, “mean_delay_6m”, “mean_delay_12m” are computing the mean delay for 1/3/6/12 months for each contract.

1.5 Exploratory analysis of the new features

1.5.1 Data description report

Column name Description Type
delay Difference between payment_date and due_date Int
delay_21_y T/F if the delay is more than 21 days Factor
delay_140_y T/F if the delay is more than 140 days Int
delay_indiv Mean delay for the whole client’s history Int
delay_indiv_21 Cumulative sum of payments delayed for more than 21 days by contract Int
delay_indiv_140 Cumulative sum of the payments delayed for more than 140 days by contract Int
mean_delay_1m Average payment delay for the last month Int
mean_delay_3m Average payment delay for the last 3 months Int
mean_delay_6m Average payment delay for the last 6 months Int
mean_delay_12m Average payment delay for the last 12 months Int

1.5.2 Basic statistics

We computed basic statistics for each new attribute. All the visualizations can be seen on @ref(fig:stat_new_delay).

First of all we focused on attribute delay. Since it is probably the most important attribute of the newly created, we analyzed its distribution, distribution of delay on log scale DOPLNIT CO SME NASLI.
Basic statistics of the added attribute delay.

(#fig:stat_new_delay-1)Basic statistics of the added attribute delay.

## [[1]]
Basic statistics of the added attribute delay.

(#fig:stat_new_delay-2)Basic statistics of the added attribute delay.

## 
## [[2]]
## NULL
## 
## [[3]]
Basic statistics of the added attribute delay.

(#fig:stat_new_delay-3)Basic statistics of the added attribute delay.

## 
## [[4]]
Basic statistics of the added attribute delay.

(#fig:stat_new_delay-4)Basic statistics of the added attribute delay.

## 
## [[5]]
Basic statistics of the added attribute delay.

(#fig:stat_new_delay-5)Basic statistics of the added attribute delay.

We also computed some basic statistics for the other added attributes. DOPLNIT CO SME NASLI. Results can be seen on @ref(fig:stat_new)
Basic statistics of the added attributes..

(#fig:stat_new)Basic statistics of the added attributes..

delay_21 delay_140

1.5.3 Missing values

As can be seen on Figure, newly-created features also contain NA values. The highest percentage of missing values has attribute mean_delay_12m, almost 55 %. Together with mean_delay_6m, mean_delay_3m and mean_delay_1m, they are the only new attributes holding NA attributes.

It is not surprising, that these attributes have the highest percentage of NAs, since they compute results only every 12/6/3/1 months.

We decided to replace the NA values by 0, so they can be later used in the modeling part.

## # A tibble: 34 x 3
##    variable               n_miss pct_miss
##    <chr>                   <int>    <dbl>
##  1 mean_delay_12m         804268    55.4 
##  2 mean_delay_6m          458935    31.6 
##  3 mean_delay_3m          249422    17.2 
##  4 mean_delay_1m           95489     6.58
##  5 contract_id                 0     0   
##  6 payment_order               0     0   
##  7 due_date                    0     0   
##  8 living_area                 0     0   
##  9 different_contact_area      0     0   
## 10 kc_flag                     0     0   
## # ... with 24 more rows

1.6 Advanced statistics

paid_amount_gender_prod amount_contract amount_earnings amount_product amount_earnings

dependence_delay_gender amount_gender amount_gender_type_discount delay_gender_

2 Modeling

Predict if the customer will be delayed in payment for 21+ days.

Significantly better than random model AUC > 0.7. In case of highly unbalanced other metrics such as AUCPR etc.

Predict if the customer will be delayed in payment for 140+ days.

Significantly better than random model AUC > 0.7. In case of highly unbalanced other metrics such as AUCPR etc.

Estimate average number of days delayed if the client would exceed the first action.

Better than simple average model by at least 30%.

2.1 Prediction model (21+ days)

2.2 Prediction model (140+ days)

2.3 Estimation of the expected number of days of delay when the client triggers first action

3 Conclusion

4 Discussion

5 Attachments

\label{fig:density}Density plots.

Figure 5.1: Density plots.

\label{fig:boxplots}Boxplots for numeric attributes.

Figure 5.2: Boxplots for numeric attributes.